Selecting feature subset for high dimensional data via the propositional FOIL rules

نویسندگان

  • Guangtao Wang
  • Qinbao Song
  • Baowen Xu
  • Yuming Zhou
چکیده

Feature interaction is an important issue in feature subset selection. However, most of the existing algorithms only focus on dealing with irrelevant and redundant features. In this paper, a propositional FOIL rule based algorithm FRFS, which not only retains relevant features and excludes irrelevant and redundant ones but also considers feature interaction, is proposed for selecting feature subset for high dimensional data. FRFS first merges the features appeared in the antecedents of all FOIL rules, achieving a candidate feature subset which excludes redundant features and reserves interactive ones. Then, it identifies and removes irrelevant features by evaluating features in the candidate feature subset with a new metric CoverRatio, and obtains the final feature subset. The efficiency and effectiveness of FRFS are extensively tested upon both synthetic and real world data sets, and it is compared with other six representative feature subset selection algorithms, including CFS, FCBF, Consistency, Relief-F, INTERACT, and the rule-based FSBAR, in terms of the number of selected features, runtime and the classification accuracies of the four well-known classifiers including Naive Bayes, C4.5, PART and IB1 before and after feature selection. The results on the five synthetic data sets show that FRFS can effectively identify irrelevant and redundant features while reserving interactive ones. The results on the 35 real world high dimensional data sets demonstrate that compared with other six feature selection algorithms, FRFS cannot only efficiently reduce the feature space, but also can significantly improve the performance of the four well-known classifiers. & 2012 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts

High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

Improvement of effort estimation accuracy in software projects using a feature selection approach

In recent years, utilization of feature selection techniques has become an essential requirement for processing and model construction in different scientific areas. In the field of software project effort estimation, the need to apply dimensionality reduction and feature selection methods has become an inevitable demand. The high volumes of data, costs, and time necessary for gathering data , ...

متن کامل

A New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum

Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...

متن کامل

تعیین ماشین‌های بردار پشتیبان بهینه در طبقه‌بندی تصاویر فرا طیفی بر مبنای الگوریتم ژنتیک

Hyper spectral remote sensing imagery, due to its rich source of spectral information provides an efficient tool for ground classifications in complex geographical areas with similar classes. Referring to robustness of Support Vector Machines (SVMs) in high dimensional space, they are efficient tool for classification of hyper spectral imagery. However, there are two optimization issues which s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 46  شماره 

صفحات  -

تاریخ انتشار 2013